In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import matplotlib.pyplot as plt

Introduction to Tethne: Words and Topic Modeling

In this workbook we'll start working with word-based FeatureSets, and use the model.corpus.mallet module to fit a Latent Dirichlet Allocation (LDA) topic model.

Before you start

  • Download the practice dataset from here, and store it in a place where you can find it. You'll need the full path to your dataset.

Loading JSTOR DfR datasets

First, import the dfr module from tethne.readers.


In [3]:
from tethne.readers import dfr

Unlike WoS datasets, DfR datasets can contain wordcounts, bigrams, trigrams, and quadgrams in addition to bibliographic data. read will automagically load those data as FeatureSets.


In [4]:
corpus = dfr.read('/Users/erickpeirson/Projects/tethne-notebooks/data/dfr')

In [5]:
print 'There are %i papers in this corpus.' % len(corpus.papers)


There are 241 papers in this corpus.

In [6]:
print 'This corpus contains the following features: \n\t%s' % '\n\t'.join(corpus.features.keys())


This corpus contains the following features: 
	citations
	wordcounts
	authors

Whereas Corpora generated from WoS datasets are indexed by wosid, Corpora generated from DfR datasets are indexed by doi.


In [7]:
corpus.indexed_papers.keys()[0:10]    # The first 10 dois in the Paper index.


Out[7]:
['10.2307/2418718',
 '10.2307/2258178',
 '10.2307/3241549',
 '10.2307/2416998',
 '10.2307/20000814',
 '10.2307/2428935',
 '10.2307/2418714',
 '10.2307/1729159',
 '10.2307/2407516',
 '10.2307/2816048']

So you can retrieve specific Papers from Corpus.papers using their DOIs:


In [8]:
corpus['10.2307/2418718']


Out[8]:
<tethne.classes.paper.Paper at 0x1075f42d0>

Working with featuresets

FeatureSets are stored in the features attribute of a Corpus object. Corpus.features is just a dict, so you can see which featuresets are available by calling its keys method.


In [9]:
print 'This corpus contains the following featuresets: \n\t{0}'.format(
            '\n\t'.join(corpus.features.keys())    )


This corpus contains the following featuresets: 
	citations
	wordcounts
	authors

It's not uncommon for featuresets to be incomplete; sometimes data won't be available for all of the Papers in the Corpus. An easy way to check the number of Papers for which data is available in a FeatureSet is to get the size (len) of the features attribute of the FeatureSet. For example:


In [10]:
print 'There are %i papers with unigram features in this corpus.' % len(corpus.features['wordcounts'].features)


There are 241 papers with unigram features in this corpus.

To check the number of features in a featureset (e.g. the size of a vocabulary), look at the size of the index attribute.


In [11]:
print 'There are %i words in the unigrams featureset' % len(corpus.features['wordcounts'].index)


There are 52207 words in the unigrams featureset

index maps words to an integer representation. Here are the first ten words in the FeatureSet:


In [12]:
corpus.features['wordcounts'].index.items()[0:10]


Out[12]:
[(0, 'dynamic'),
 (1, 'relationships'),
 (2, 'calculate'),
 (3, 'physiological'),
 (4, 'segments'),
 (5, 'under'),
 (6, 'emlen'),
 (7, 'worth'),
 (8, 'experimentally'),
 (9, 'alpine')]

Applying a stoplist

In many cases you may wish to apply a stoplist (a list of features to exclude from analysis) to a featureset. In our DfR dataset, the most common words are prepositions and other terms that don't really have anything to do with the topical content of the corpus.


In [13]:
corpus.features['wordcounts'].top(5)    # The top 5 words in the FeatureSet.


Out[13]:
[('the', 74619.0),
 ('of', 67372.0),
 ('and', 42712.0),
 ('in', 36453.0),
 ('a', 22590.0)]

You may apply any stoplist you like to a featureset. In this example, we import the Natural Language ToolKit (NLTK) stopwords corpus.


In [3]:
from nltk.corpus import stopwords
stoplist = stopwords.words()

We then create a function that will evaluate whether or not a word is in our stoplist. The function should take three arguments:

  • f -- the feature itself (the word)
  • v -- the number of instances of that feature in a specific document
  • c -- the number of instances of that feature in the whole FeatureSet
  • dc -- the number of documents that contain that feature

This function will be applied to each word in each document. If it returns 0 or None, the word will be excluded. Otherwise, it should return a numeric value (in this case, the count for that document).

In addition to applying the stoplist, we'll also exclude any word that occurs in more than 50 of the documents and less than 3 documents.


In [17]:
def apply_stoplist(f, v, c, dc):
    if f in stoplist or dc > 50 or dc < 3:
        return 0
    return v

We apply the stoplist using the transform() method. FeatureSets are not modified in place; instead, a new FeatureSet is generated that reflects the specified changes. We'll call the new FeatureSet 'wordcounts_filtered'.


In [18]:
corpus.features['wordcounts_filtered'] = corpus.features['wordcounts'].transform(apply_stoplist)

In [19]:
print 'There are %i words in the wordcounts_filtered FeatureSet' % len(corpus.features['wordcounts_filtered'].index)


There are 12204 words in the wordcounts_filtered FeatureSet

Topic modeling with featuresets

Latent Dirichlet Allocation is a popular approach to discovering latent "topics" in large corpora. Many digital humanists use a software package called MALLET to fit LDA to text data. Tethne uses MALLET to fit LDA topic models.

Start by importing the mallet module from the tethne.model.corpus subpackage.


In [4]:
from tethne.model.corpus import mallet

Now we'll create a new LDAModel for our Corpus. The featureset_name parameter tells the LDAModel which FeatureSet we want to use. We'll use our filtered wordcounts.


In [21]:
model = mallet.LDAModel(corpus, featureset_name='wordcounts_filtered')

Next we'll fit the model. We need to tell MALLET how many topics to fit (the hyperparameter Z), and how many iterations (max_iter) to perform. This step may take a little while, depending on the size of your corpus.


In [47]:
model.fit(Z=50, max_iter=500)

You can inspect the inferred topics using the model's print_topics() method. By default, this will print the top ten words for each topic.


In [48]:
model.print_topics()


Topic	Top 10 words
0  	chromosomes tetraploid sterile diploid race triploid caespitosa vigorous african pairing
1  	spruce tn black columbia cd dormancy provenances elevation picea boreal
2  	pine scandinavia norway external november altitudes statement dass sweden fur
3  	islands seq island fen isles woodland limestone fern spores floras
4  	fundamental brought genetical real led simply idea purely generation put
5  	terrestrial aquatic width blade heterophylly hidden inflorescences horse flexibility petiole
6  	typha latifolia domingensis shoot shoots chlorophyll night productivity photoperiod angustifolia
7  	drought gunnera km mm borealis locality moisture san wet eucalyptus
8  	public program social programs policy aaas education scientists conservation professional
9  	ecotype turesson cultivated hereditary hiesey ecospecies clausen gregor plantago subspecies
10 	illustrations volumes editors written copies edited editor contributors price pages
11 	nitrogen metabolism fixation fungi mycorrhizal uptake ultrastructure carbon ustilago mycorrhizas
12 	subsp rotundifolia district tetraploids campanula tetraploid mts purpureum mm diploids
13 	gradient pine elevation gradients moisture quercus climax whittaker dominant associations
14 	models fitness effort strategy strategies mortality mathematical adaptations model spatial
15 	pollination inbreeding graminea floral selfing insect insects outbreeding road wind
16 	facts dealing definition taxonomist places difficulties floras taxonomists definitions refer
17 	session meeting joint union program dept pm social aibs william
18 	teaching professor experience school sciences curriculum opportunity employer send massachusetts
19 	herbarium zealand mm leaflets stems cultures compound rosa varieties sepals
20 	today behavior unique developments numerical outstanding aspect proceedings concern precise
21 	service electron circle medical write nuclear box applications card april
22 	committee organization ecologists article training council biosystematics journals meeting descriptive
23 	tillers ns greenland vernalization quadrats island alpinum inflorescence georgia tiller
24 	greenhouse dormancy alpestre photoperiod hr germinated performance elevations bow medicine
25 	clones transplant clone clonal phenological colorado transplanted flowered texas crown
26 	stands forests elevations ft shrub mesic heath slopes herb stems
27 	illus pages edited chemistry edition physics eds nov jan july
28 	montane vulgaris berkeley sierran inland prunella race californian mortality cascade
29 	jour jan nov feb oct mar dec dee canad rio
30 	warm cool acclimation respiration dark photosynthetic regime net thermal photosynthesis
31 	room michigan indiana hall building illinois bldg cornell md texas
32 	agrostis festuca tenuis lolium perenne jenkin crosses stolonifera cent pasture
33 	efficiency model resistance calculated convection transpiration variables stomatal air optimal
34 	desert poa prairie woody subtropical communis sp class phragmites vascular
35 	disruptive mating divergence subspecies drosophila ssp heredity polymorphism thoday migration
36 	marsh salt graminea marshes sandy alterniflora sagittaria puccinellia aquatic tall
37 	abundant striking fairly bearing majority failure difficulty typical modified employed
38 	chapters chapter devoted discipline reviewed elementary edition autecology textbook palynology
39 	lawn prostrate erect plantago cn lawns wt poa ns bowling
40 	darwin cultivated harrison domestic genes crossing varieties inheritance induced domestication
41 	cape snow december january net antarctic biomass moss photosynthesis air
42 	william dept david richard george james charles bs ms thomas
43 	ecosystem rain productivity ecosystems forests ecologists nutrient succession tropics management
44 	component error sampled estimated expression correlations location trend reflect achieved
45 	plots park ca odoratum contrasting calcareous trial calcium trials anthoxanthum
46 	mr nepal wheat professor barley pakistan accessions wind fertile productivity
47 	arctic relict glacial relicts lakes century botanists stable ice fauna
48 	flushing chilling bud frost burst regression traits variables oregon provenance
49 	speciation biosystematic stebbins publ biosystematics polyploids chromosomal barriers formal polyploidy

We can also look at the representation of a topic over time using the topic_over_time() method. In the example below we'll print the first five of the topics on the same plot.


In [49]:
plt.figure(figsize=(15, 5))
for k in xrange(5):
    x, y = model.topic_over_time(k)
    plt.plot(x, y, label='topic {0}'.format(k), lw=2, alpha=0.7)
plt.legend(loc='best')
plt.show()


Generating networks from topic models

The features module in the tethne.networks subpackage contains some useful methods for visualizing topic models as networks. You can import it just like the authors or papers modules.


In [5]:
from tethne.networks import topics

In [51]:
termGraph = topics.terms(model, threshold=0.01)

In [66]:
termGraph.name = ''

The topic_coupling function generates a network of words connected on the basis of shared affinity with a topic. If two words i and j are both associated with a topic z with $\Phi(i|z) >= 0.01$ and $\Phi(j|z) >= 0.01$, then an edge is drawn between them.

The resulting graph will be smaller or larger depending on the value that you choose for threshold. You may wish to increase or decrease threshold to achieve something interpretable.


In [67]:
print 'There are {0} nodes and {1} edges in this graph.'.format(
            len(termGraph.nodes()), len(termGraph.edges()))


There are 471 nodes and 3155 edges in this graph.

The resulting Graph can be written to GraphML just like any other Graph.


In [37]:
from tethne.writers import graph

In [69]:
graphpath = '/Users/erickpeirson/Projects/tethne-notebooks/output/lda.graphml'
graph.write_graphml(termGraph, graphpath)

The network visualization below was generated in Cytoscape. Edge width is a function of the 'weight' attribute. Edge color is based on the 'topics' attribute, to give some sense of how which clusters of terms belong to which topics. We can see right away that terms like plants, populations, species, and selection are all very central to the topics retrieved by this model.

WoS abstracts

JSTOR DfR is not the only source of wordcounts with which we perform topic modeling. For records more recent than about 1990, the Web of Science includes abstracts in their bibliographic records.

Let's first spin up our WoS dataset.


In [6]:
from tethne.readers import wos
wosCorpus = wos.read('/Users/erickpeirson/Projects/tethne-notebooks/data/wos')

Here's what one of the abstracts looks like:


In [7]:
wosCorpus[0].abstract


Out[7]:
u'Demographic models are powerful tools for making predictions about the relative importance of transitions from one life stage (e. g., seeds) to another (e. g., nonreproductives); however, they have never been used to compare the relative performance of invasive and noninvasive taxa. I use demographic models parameterized from common garden experiments to develop hypotheses about the role of different life stage transitions in determining differences in performance in invasive and noninvasive congeners in the Commelinaceae. I also extended nested life table response experiment (LTRE) analyses to accommodate interactions between nested and unnested factors. Invasive species outperformed their noninvasive congeners, especially under high-nutrient conditions. This difference in performance did not appear to be due to differences in elasticities of vital rates, but rather to differences in the magnitude of stage transitions. Self-compatible invasive species had greater fecundity in high-nutrient environments and a shorter time to first reproduction, and all invasive species had greater vegetative reproduction than their noninvasive congeners. Thus greater opportunism in sexual and asexual reproduction explained the greater performance of invasive species under high-nutrient conditions. Similar common garden experiments could become a useful tool to predict potential invaders from pools of potential introductions. I show that short-term and controlled experiments considering multiple nutrient environments may accurately predict invasiveness of nonnative plant species. C1 Washington Univ, Tyson Res Ctr, Dept Biol, St Louis, MO 63130 USA.'

The abstract_to_features method converts all of the available abstracts in our Corpus to a unigram featureset. It takes no arguments. The abstracts will be diced up into their constituent words, punctuation and capitalization is removed, and a featureset called abstractTerms is generated. By default, abstract_to_features will apply the NLTK stoplist and Porter stemmer.


In [8]:
from tethne import tokenize
wosCorpus.index_feature('abstract', tokenize=tokenize, structured=True)

Sure enough, here's our 'abstract' featureset:


In [9]:
wosCorpus.features.keys()


Out[9]:
['abstract', 'citations', 'authors']

Since we're not working from OCR texts (that's where JSTOR DfR data comes from), there are far fewer "junk" words. We end up with a much smaller vocabulary.


In [10]:
print 'There are {0} features in the abstract featureset.'.format(len(wosCorpus.features['abstract'].index))


There are 23732 features in the abstract featureset.

But since not all of our WoS records come from >= 1990, there are a handful for which there are no abstract terms.


In [11]:
print 'Only {0} of {1} papers have abstracts, however.'.format(len(wosCorpus.features['abstract'].features), len(wosCorpus.papers))


Only 1778 of 1859 papers have abstracts, however.

In [12]:
filter = lambda f, v, c, dc: f not in stoplist and 2 < dc < 400
wosCorpus.features['abstract_filtered'] = wosCorpus.features['abstract'].transform(filter)

In [13]:
print 'There are {0} features in the abstract_filtered featureset.'.format(len(wosCorpus.features['abstract_filtered'].index))


There are 23732 features in the abstract_filtered featureset.

In [14]:
type(wosCorpus.features['abstract_filtered'])


Out[14]:
tethne.classes.feature.StructuredFeatureSet

The 'abstract' featureset is similar to the 'unigrams' featureset from the JSTOR DfR dataset, so we can perform topic modeling with it, too.


In [15]:
wosModel = mallet.LDAModel(wosCorpus, featureset_name='abstract_filtered')

In [18]:
wosModel.fit(Z=50, max_iter=500)

In [19]:
wosModel.print_topics(Nwords=5)


Topic	Top 10 words
0  	traits selection variation genetic trait
1  	species garden common areas distribution
2  	usa univ ca dept calif
3  	forest canada pine red bc
4  	important fungal found dept disease
5  	gene garden flow spatial distance
6  	larvae food larval predation common
7  	seedlings trees height survival high
8  	life time development history age
9  	variation differences morphological significant observed
10 	univ sci dept ctr res
11 	seed seeds germination source seedling
12 	china sci mexico univ lab
13 	individuals garden univ dept common
14 	size body mass larger populations
15 	light ecotypes canopy ciencias high
16 	genotypes genetic variation community usa
17 	environmental conditions studies results factors
18 	effects effect plant interactions interaction
19 	france umr observed expression genes
20 	water drought lower dry higher
21 	inst univ switzerland dept biol
22 	reproductive females reproduction sexual males
23 	plasticity phenotypic divergence populations differences
24 	levels clones higher study significant
25 	leaf area leaves correlated photosynthetic
26 	root elsevier reserved rights low
27 	increased conditions response responses change
28 	usa univ dept biol state
29 	plants plant flowering competition production
30 	native invasive range introduced plants
31 	herbivores resistance plant plants damage
32 	season phenology spring cold growing
33 	wild fish survival reaction salmon
34 	temperature degrees northern southern temperatures
35 	growth rate rates higher differences
36 	local adaptation habitats habitat environments
37 	geographic variation data variables region
38 	univ ecol plant dept management
39 	patterns pattern consistent tests group
40 	australia univ sci sch biol
41 	populations population genetic differentiation diversity
42 	fitness hybrid hybrids offspring maternal
43 	soil litter nutrient decomposition quality
44 	metabolic birds physiological differences studies
45 	genetic analysis variation markers molecular
46 	sites site field common garden
47 	forest species tree soil stands
48 	high experiments low study due
49 	tolerance dept biol stress usa

topics.cotopics() creates a network of topics, linked by virtue of their co-occurrence in documents. Use the threshold parameter to tune the density of the graph.


In [39]:
coTopicGraph = topics.cotopics(wosModel, threshold=0.15)

In [44]:
print '%i nodes and %i edges' % (coTopicGraph.order(), coTopicGraph.size())


45 nodes and 95 edges

In [41]:
graph.write_graphml(coTopicGraph, '/Users/erickpeirson/Projects/tethne-notebooks/output/lda_coTopics.graphml')

topics.topicCoupling() creates a network of documents, linked by virtue of containing shared topics. Again, use the threshold parameter to tune the density of the graph.


In [50]:
topicCoupling = topics.topic_coupling(wosModel, threshold=0.2)

In [51]:
print '%i nodes and %i edges' % (topicCoupling.order(), topicCoupling.size())


716 nodes and 7685 edges

In [52]:
graph.write_graphml(topicCoupling, '/Users/erickpeirson/Projects/tethne-notebooks/output/lda_topicCoupling.graphml')


In [ ]: